104 research outputs found

    Developing an Assessment Checklist

    Get PDF
    The goal of this project was to prepare an assessment checklist for the Information Retrieval (IR) course at the department of Computer Science in the academic year 2019/2020. This project was motivated by some observations regarding the previous edition of the IR course: in the 2018/2019 edition there was a clear mismatch between students and teachers expectations regarding the assignment. The students were struggling in understanding how to structure and write a good quality assignment. Furthermore, even if the students were instructed with guidelines on how to give feedback, they were struggling also in providing useful feedback to their peers. With the proposed assessment checklist, we aimed at guiding and helping students in structuring their assignment and peer reviews. This paper is organised as follows: Section 1 describes the course during which the project was carried out; Section 2 presents the project goals and motivations; Section 3 carefully describes how the project was conducted; Section 4 reports some analysis about the project results; and Section 5 presents conclusions and future challenges

    Exploiting user signals and stochastic models to improve information retrieval systems and evaluation

    Get PDF
    The leitmotiv throughout this thesis is represented by IR evaluation. We discuss different issues related to effectiveness measures and novel solutions that we propose to address these challenges. We start by providing a formal definition of utility-oriented measurement of retrieval effectiveness, based on the representational theory of measurement. The proposed theoretical framework contributes to a better understanding of the problem complexities, separating those due to the inherent problems in comparing systems, from those due to the expected numerical properties of measures. We then propose AWARE, a probabilistic framework for dealing with the noise and inconsistencies introduced when relevance labels are gathered with multiple crowd assessors. By modeling relevance judgements and crowd assessors as sources of uncertainty, we directly combine the performance measures computed on the ground-truth generated by each crowd assessor, instead of adopting a classification technique to merge the labels at pool level. Finally, we investigate evaluation measures able to account for user signals. We propose a new user model based on Markov chains, that allows the user to scan the result list with many degrees of freedom. We exploit this Markovian model in order to inject user models into precision, defining a new family of evaluation measures, and we embed this model as objective function of an LtR algorithm to improve system performances

    Learning Recommendations from User Actions in the Item-poor Insurance Domain

    Full text link
    While personalised recommendations are successful in domains like retail, where large volumes of user feedback on items are available, the generation of automatic recommendations in data-sparse domains, like insurance purchasing, is an open problem. The insurance domain is notoriously data-sparse because the number of products is typically low (compared to retail) and they are usually purchased to last for a long time. Also, many users still prefer the telephone over the web for purchasing products, reducing the amount of web-logged user interactions. To address this, we present a recurrent neural network recommendation model that uses past user sessions as signals for learning recommendations. Learning from past user sessions allows dealing with the data scarcity of the insurance domain. Specifically, our model learns from several types of user actions that are not always associated with items, and unlike all prior session-based recommendation models, it models relationships between input sessions and a target action (purchasing insurance) that does not take place within the input sessions. Evaluation on a real-world dataset from the insurance domain (ca. 44K users, 16 items, 54K purchases, and 117K sessions) against several state-of-the-art baselines shows that our model outperforms the baselines notably. Ablation analysis shows that this is mainly due to the learning of dependencies across sessions in our model. We contribute the first ever session-based model for insurance recommendation, and make available our dataset to the research community

    improving information retrieval evaluation via markovian user models and visual analytics

    Get PDF
    To address the challenge of adapting experimental evaluation to the constantly evolving user tasks and needs, we develop a new family of Markovian Information Retrieval (IR) evaluation measures, called Markov Precision (MP), where the interaction between the user and the ranked result list is modelled via Markov chains, and which will be able to explicitly link lab-style and on-line evaluation methods. Moreover, since experimental results are often not so easy to understand, we will develop a Web-based Visual Analytics (VA) prototype where an animated state diagram of the Markov chain will explain how the user is interacting with the ranked result list in order to offer a support for a careful failure analysis

    Evaluation Measures of Individual Item Fairness for Recommender Systems: A Critical Study

    Full text link
    Fairness is an emerging and challenging topic in recommender systems. In recent years, various ways of evaluating and therefore improving fairness have emerged. In this study, we examine existing evaluation measures of fairness in recommender systems. Specifically, we focus solely on exposure-based fairness measures of individual items that aim to quantify the disparity in how individual items are recommended to users, separate from item relevance to users. We gather all such measures and we critically analyse their theoretical properties. We identify a series of limitations in each of them, which collectively may render the affected measures hard or impossible to interpret, to compute, or to use for comparing recommendations. We resolve these limitations by redefining or correcting the affected measures, or we argue why certain limitations cannot be resolved. We further perform a comprehensive empirical analysis of both the original and our corrected versions of these fairness measures, using real-world and synthetic datasets. Our analysis provides novel insights into the relationship between measures based on different fairness concepts, and different levels of measure sensitivity and strictness. We conclude with practical suggestions of which fairness measures should be used and when. Our code is publicly available. To our knowledge, this is the first critical comparison of individual item fairness measures in recommender systems.Comment: Accepted to ACM Transactions on Recommender Systems (TORS

    University of Copenhagen Participation in TREC Health Misinformation Track 2020

    Full text link
    In this paper, we describe our participation in the TREC Health Misinformation Track 2020. We submitted 1111 runs to the Total Recall Task and 13 runs to the Ad Hoc task. Our approach consists of 3 steps: (1) we create an initial run with BM25 and RM3; (2) we estimate credibility and misinformation scores for the documents in the initial run; (3) we merge the relevance, credibility and misinformation scores to re-rank documents in the initial run. To estimate credibility scores, we implement a classifier which exploits features based on the content and the popularity of a document. To compute the misinformation score, we apply a stance detection approach with a pretrained Transformer language model. Finally, we use different approaches to merge scores: weighted average, the distance among score vectors and rank fusion

    Basis of a Formal Framework for Information Retrieval Evaluation Measurements

    Get PDF
    Abstract. In this paper we present a formal framework, based on the representational theory of measurement and we define and study the properties of utility-oriented measurements of retrieval effectiveness like AP, RBP, ERR and many other popular IR evaluation measures

    A importância do setor sucroalcooleiro e suas relações com a estrutura produtiva da economia

    Get PDF
    In an economic context in which the state is reducing its role in the economy, the agents involved with the Sugar Cane and Alcohol sector, usually highly dependent on government policies, have to change their behavior so they can operate in a competitive market without the benefits from the state. Therefore, an analysis of the economic relationships between this sector and the economic structure of Brazil would help to define how this sector could change to the new economic conditions. As such, the goals of this paper are to identify: i) the importance, in terms of backward and forward linkages, of the Sugar Cane and Alcohol sector in the economy, using the concepts of the Hirschman/Rasmussen Indexes (HR) and Pure Linkage Indexes; ii) how changes in the use coefficients of Sugar Cane and Alcohol products, by the sectors that use then as inputs, would spread through out the economy, using the Field of Influence approach; iii) the majors relationships in the economy. The data used in this paper refers to the Brazilian input-output tables construct for the years of 1985, 1992 and 1995 at the level of 34 sectors. The major findings for the HR indexes show that the importance of the Sugar Cane and Alcohol sector, in terms of productive links, practically had no change in the 1985/1995 period. The results for the Normalized Pure Linkages, that show the relative importance of the sector in terms of production generation, show that the sector has improved its position in the 1985 to 1992 period, decreasing in the following one, i.e., 1992 to 1995

    Automated Medical Coding on MIMIC-III and MIMIC-IV: A Critical Review and Replicability Study

    Full text link
    Medical coding is the task of assigning medical codes to clinical free-text documentation. Healthcare professionals manually assign such codes to track patient diagnoses and treatments. Automated medical coding can considerably alleviate this administrative burden. In this paper, we reproduce, compare, and analyze state-of-the-art automated medical coding machine learning models. We show that several models underperform due to weak configurations, poorly sampled train-test splits, and insufficient evaluation. In previous work, the macro F1 score has been calculated sub-optimally, and our correction doubles it. We contribute a revised model comparison using stratified sampling and identical experimental setups, including hyperparameters and decision boundary tuning. We analyze prediction errors to validate and falsify assumptions of previous works. The analysis confirms that all models struggle with rare codes, while long documents only have a negligible impact. Finally, we present the first comprehensive results on the newly released MIMIC-IV dataset using the reproduced models. We release our code, model parameters, and new MIMIC-III and MIMIC-IV training and evaluation pipelines to accommodate fair future comparisons.Comment: 11 pages, 6 figures, to be published in Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '23), July 23--27, 2023, Taipei, Taiwa

    Graph-based Recommendation for Sparse and Heterogeneous User Interactions

    Full text link
    Recommender system research has oftentimes focused on approaches that operate on large-scale datasets containing millions of user interactions. However, many small businesses struggle to apply state-of-the-art models due to their very limited availability of data. We propose a graph-based recommender model which utilizes heterogeneous interactions between users and content of different types and is able to operate well on small-scale datasets. A genetic algorithm is used to find optimal weights that represent the strength of the relationship between users and content. Experiments on two real-world datasets (which we make available to the research community) show promising results (up to 7% improvement), in comparison with other state-of-the-art methods for low-data environments. These improvements are statistically significant and consistent across different data samples
    • …
    corecore